Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Algorithm 1 connect(•, •) function used by SCA to calculate the score for connecting two anchors with the label ℓ.
1: end if 6: end for 7: return ∆= • min {d Q , l L } + ∆ G (g best ) Algorithm 2 connect(•, •) function used by MLC to calculate the score for connecting two anchors of length l.
Input: Anchors α(ij , lj , vj , ℓj ) and α(i L , l L , v L , ℓ L ) that originate from alignment ax to label ℓj and ay to label ℓ L , respectively, such that ij < i L .A deletion open score ∆ DO < 0. Gap penalty function ∆ G .Output: A score s ∈ Z.

Merging single-label anchors into MUMs
We define a unipath as a maximal non-branching path in a DBG whose spelling is called a unitig.In the initial anchor set, all anchors are of the same length l.Then, given two consecutive anchors α(i1, l1, v1) and α(i2, l2, v2) (since we may be merging into a previously merged anchor), if i1 + l + 1 = i2 + l2, if these are the only anchors with end positions i1 + l and i2 + l2, if (v1, v2) ∈ E, and if v1 and v2 are in the same unipath, we merge them into a single anchor of length l2 + 1.

Probabilistic graphical alignment models
Given a starting node in the graph, the first generative model (called the target model) generates query sequences with probabilities proportional to their similarity to the spellings of walks that originate at the starting node, assuming that the query sequence is generated by a series of edits to a walk spelling.The second model (called the null model) generates both query sequences and graph walks with probabilities only proportional to the sequences' respective lengths, assuming that the two tasks occur independently.Each alignment operation E (defined in Main Section 2.1) corresponds to traversing an edge in the graphical model (called a transition), which occurs with a transition probability denoted by Pr(E) for the target model and Pr0(E) for the null model.The graphical model emits a query character if E ∈ {=, ̸ =, IO, IE} and it traverses a graph edge to emit a target character if E ∈ {=, ̸ =, DO, DE}.We define the score of E as where λ E is a user-set scaling constant.Thus, the sum of all operation scores (i.e., the alignment score ∆ S ) corresponds to the log-likelihood ratio of the corresponding walk probabilities in the target and null models.

Inferring taxonomic ranks from WGSUniFrac values
When computing WGSUniFrac errors, the algorithm assigns a branch length of x −1 , where x is the distance from the branch to the tree root.We consider nine different ranks: superkingdom, phylum, class, order, family, genus, species, strain, accession, with corresponding branch lengths 1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 8 , 1 9 .If we assume that the taxonomic profile of a read contains a single accession (and by definition, the ground-truth profile has a single accession), then the unweighted WGSUniFrac error for accuracy at level x (meaning a mismatch at rank x+1) is 2 • 9 i=x+1 1 i since there is an incorrect branch in both profiles.Since the maximum WGSUniFrac error value is ∼5.66, each unweighted error is divided by this constant.Based on this interpretation, we calculate midpoints for WGSUniFrac error and accuracy value ranges corresponding to classification accuracy at least taxonomic rank in Table 1.

Motivations for alignment parameters
Our value of ∆ J := ∆ DE is motivated by the fact that we use node length changes as a proxy for inserting nodes into the graph.Thus, we score each unit change in node length with the same penalty as a character insertion into the reference (i.e., a deletion from the query).We choose a linear penalty model instead of affine because the probability of a false anchor chaining grows exponentially as the node length difference changes (unlike the argument used to support affine penalties in traditional alignment gap scoring that the length of an indel is less important than the event of an indel being opened).
The choice of the chain density parameter ρ is meant to provide a sufficiently rich number of single-label alignments to MLC while not incurring a high anchor extension cost (Tab.2).For Illumina, HiFi, and CLR reads, a value of 0.01 leads to the greatest mean query coverage, while for ONT reads, the ρ values leading to the top two mean coverages come with a significant alignment time cost.
The chaining parameter b2 := 4 is the same value used by the 2022 co-linear chaining work by Jain et al. (2022) on which our algorithm was based.Due to our anchor filtration in MLC (Main §3.5), we expect a larger distance between anchors since many anchors that occur in only one single-label alignment are considered to be redundant and discarded.Thus, we use b := 400 instead of 100 to avoid one iteration of the while loop in Main Algorithm 1.

Graph construction and indexing
For the following operations, we report times for execution on a single thread.Cleaning all input read files took 190m50.05s.From the cleaned sequence sets, assembling all walk covers took 46m58.23s.Constructing the BOSS representation of the joint graph from the walk covers took 2m58s.Afterwards, annotating the joint graph took 33m1.65s.Finally, constructing the HyperLogLog++ sketches of the annotation took 8.16s.We note that the input cleaning, walk cover construction, and annotation of the joint graph can be run concurrently in a distributed fashion on multiple compute nodes.Joint graph construction can run with multiple threads.Converting the graph to the GFA format for GraphAligner took 2m50s.Construction of the PLAST index from the walk covers took 19m47s.4. Supplementary Tables Table 3. Evaluation measures for the alignments produced by each tool.Values between brackets represent 95% confidence intervals from 1000 bootstrap samples.

Fig. 1 :
Fig. 1: Number of k-mers (left) and walk cover sizes (right) in the simulated assembly graphs..

Table 1 .
Midpoint WGSUniFrac error and accuracy values for matching accuracy at each taxonomic rank.UniFrac accuracy is 1 − WGSUniFrac error.

Table 2 .
Parameter sweep of the chain density parameter ρ for MLA (SCA+LC+NC).Mean total coverage is the mean percentage of query characters that are covered by at least one alignment.

Table 4 .
Breakdown of execution times (in seconds) for each component of the alignment workflow.